Data Analysis: RecipeDB

  • Exploratory Data Analysis
  • Principal Component Analysis
  • Correlation Coefficients
  • Network Analysis
    • VoteRank Algorithm
    • Backbone Extraction
    • Community Detection
In [1]:
#load libraries
import numpy as np
import pandas as pd 
import csv

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import plotly
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx
from pyvis.network import Network

Reading the Dataset


Our dataset contains information about 118171 recipes and 50 corresponding nutrients.

In [2]:
df = pd.read_csv('../RecipeDB/Cosy_Drive/Recipes(6).csv',low_memory=False)
df = pd.DataFrame(df)

#columns 9 to 158 contain the required information
data = df.iloc[:,9:159]
data = data.fillna(0)
data.head()
Out[2]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
0 0.0 2.1828 0.0 3.9707 8.3230 5.6772 90.24 0.6844 0.0 290.7070 ... 20.1579 0.0000 0.0000 0.0 0.0000 1.0203 0.0 19.4893 908.8117 7.7139
1 0.0 0.0820 0.0 0.2266 4.5639 0.3773 0.00 0.2780 0.0 73.1950 ... 21.0450 0.0000 0.0000 0.0 0.0000 2.0730 0.0 8.6692 603.0612 1.2356
2 0.0 4.0039 0.0 5.5042 6.3193 0.4877 0.00 0.2040 0.0 338.2219 ... 46.7970 0.4520 13.5600 0.0 0.4520 10.4014 0.0 175.3201 1151.3867 16.0640
3 0.0 2.9527 0.0 6.1530 9.9426 7.7498 0.00 4.0600 0.0 324.1933 ... 19.5802 0.0375 1.5375 0.0 0.0375 6.8426 0.0 44.9732 123.2322 9.5153
4 0.0 1.2022 0.0 3.2117 8.7440 2.1786 0.00 1.2282 0.0 747.7540 ... 323.0240 0.0000 0.0000 0.0 0.0000 0.5431 0.0 23.2386 123.8927 6.7782

5 rows × 150 columns

Exploratory Data Analysis


1. Continent Distribution

  • 40.3% of the recipes are European
  • 3.96% of the recipes are African
In [3]:
#count the Continent distribution
continents_count = df['Continent'].value_counts()
print(continents_count)

#plot a pie chart
fig = px.pie(values=continents_count,names=continents_count.index, title='Continents Distribution')
fig.show()
European          47622
Latin American    25125
Asian             23194
North American    11731
Australasian       5823
African            4676
Name: Continent, dtype: int64

2. Average Calorie Count by Continent

  • European food has the highest average calorie count (521.8 cal)
  • North American food has the lowest average calorie count (408.433 cal)
In [4]:
#calculate avergae calories by continent
calories_continent = df.groupby(['Continent']).mean().Calories
val = calories_continent.sort_values(ascending=False)

fig = px.bar(val, val.index, val, title='Average Calorie Count by Continent')
fig.show()

3. Average Calorie Count by Country

We have data from 75 countries.

  • Nepalese (813.57 cal) and Irish (813.05 cal) food have the highest average calorie counts.
  • Israeli food (256.8 cal) has the lowest average calories.
In [5]:
print('Countries:', len(np.unique(df['Sub_region'])), np.unique(df['Sub_region']))

#find mean calories grouped by 'Sub_region'
calories_country = df.groupby(['Sub_region']).mean().Calories
val = calories_country.sort_values(ascending=False)

fig = px.bar(val, val.index, val, title='Average Calorie Count by Country')
fig.show()
Countries: 75 ['Angolan' 'Argentine' 'Australian' 'Austrian' 'Bangladeshi' 'Belgian'
 'Brazilian' 'Cambodian' 'Canadian' 'Chilean' 'Chinese' 'Colombian'
 'Costa Rican' 'Cuban' 'Czech' 'Danish' 'Dutch' 'Ecuadorean' 'Egyptian'
 'English' 'Ethiopian' 'Filipino' 'Finnish' 'French' 'German' 'Greek'
 'Guatemalan' 'Honduran' 'Hungarian' 'Icelandic' 'Indian' 'Indonesian'
 'Iraqi' 'Irish' 'Israeli' 'Italian' 'Jamaican' 'Japanese' 'Korean'
 'Laotian' 'Lebanese' 'Libyan' 'Malaysian' 'Mexican' 'Mongolian'
 'Moroccan' 'Namibian' 'Nepalese' 'New Zealander' 'Nigerian' 'Norwegian'
 'Pakistani' 'Palestinian' 'Peruvian' 'Polish' 'Portuguese' 'Puerto Rican'
 'Rest Caribbean' 'Rest Eastern European' 'Rest Middle Eastern' 'Russian'
 'Saudi Arabian' 'Scottish' 'Somalian' 'Spanish' 'Sudanese' 'Swedish'
 'Swiss' 'Thai' 'Turkish' 'UK' 'US' 'Venezuelan' 'Vietnamese' 'Welsh']

4. Food Items per Category

Our dataset consists of 30 different food categories like - vegetables, meat, nuts and seeds, etc.

We have the highest count of vegetables (14.9%), followed by spices (11.8%) and lowest count of dish (0.0019%).

In [25]:
data2 = pd.read_csv('../RecipeDB/Cosy_Drive/recipe_with_category.csv')

#count the Food Category distribution
food_cat_count = data2['Dietrx_Category'].value_counts()
print(len(food_cat_count))
print(food_cat_count)

#plot a pie chart
fig = px.pie(values=food_cat_count,names=food_cat_count.index, title='Food Category Distribution')
fig.show()
30
Vegetable             172117
Spice                 136240
Herb                  114475
Dairy                 107689
Meat                   96332
Additive-Salt          55574
Plant Derivative       53588
Cereal                 49019
Fruit                  46882
Additive-Sugar         41991
Condiment              40663
Additive               39633
Beverage               27342
Beverage-Alcoholic     20656
Bakery                 20513
Legume                 20471
Nuts and Seeds         15970
Plant                  15608
Dish                   15215
Maize                  12386
Additive-Vinegar       10529
Seafood                10083
Fungi                   8382
Essential Oil           6388
Fish                    6233
Flower                  3585
Additive-Yeast          2785
MISC-Utensil             480
MISC-Other               451
dish                      22
Name: Dietrx_Category, dtype: int64

Principal Component Analysis


Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a large set of features or variables into a smaller set, while simultaneously preserving as much information as possible.

Here, we have used PCA to detect outliers amongst all the recipes based on their nutritional values.
The recipes have been categorised based on 6 continents - Europe, Asia, Africa, South (Latin) America, North America and Australia.

In [7]:
Y = df['Continent']

#scale the data
scaled_data = preprocessing.scale(data)

#create two principal components
pca = PCA(n_components=2)
pca.fit(scaled_data)
components = pca.transform(scaled_data)

#to plot
fig = px.scatter(components, x=0, y=1, color=df['Continent'], hover_name=df['Recipe_title'], title='PCA - Recipes')

#save the plot
plotly.offline.plot(fig, filename='pca_recipes.html')

fig.show()

We see information about 3 prominent outliers:

I. Jessica's Mauritian Chicken Curry

In [8]:
idx = df[df['Recipe_title']=="Jessica's Mauritian Chicken Curry"].index.values
data.iloc[idx]
Out[8]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
16780 0.0 4469.5966 0.0 4659.4356 3416.9517 4874.4803 0.0 56427.9432 0.0 46447.7737 ... 1870.8549 1359.0 40770.0 0.0 1359.0 1845.7586 0.0 2873.7468 326987.2132 4458.7217

1 rows × 150 columns

II. Malai Seekh Kebab

In [9]:
idx = df[df['Recipe_title']=="Malai Seekh Kebabs For Iftar"].index.values
data.iloc[idx]
Out[9]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
35484 0.0 9820.635 0.0 10890.9515 7062.7184 15611.2014 0.0 169503.3734 0.0 50072.9691 ... 142.2685 1.0 41.0 0.0 1.0 4000.6548 0.0 1.6328 271198.092 29651.756

1 rows × 150 columns

III. Stifado

In [10]:
idx = df[df['Recipe_title']=="Stifado (Traditional Greek Stew)"].index.values
data.iloc[idx]
Out[10]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
78586 0.0 6246.5743 0.0 7266.6385 4047.0674 9951.5176 0.0 88129.6383 0.0 49196.4542 ... 73.0235 408.0009 24480.0555 0.0 408.0009 544.0765 0.0 7280.5232 242133.6913 40190.8205

1 rows × 150 columns

These dishes have very high amounts of each nutrient present in them, as compared to the rest of the recipes and hence are outliers.

Correlation Coefficients


Correlation coefficients measure the strength of the relationship between two variables. A positive correlation coefficent values indicate a positive relationship between the variables and negative values signify a negative relationship. A value of 0 indicates no relationship.

Here, the nutrients are the variables.

Correlation Matrix

Correlation matrix represents the correlation coefficient between variables.
Variables with coefficients values close to 1.0 or -1.0 depict a strong relationship and values close to 0 show no meaningful relationship.

In [11]:
#Correlation Matrix

#corr() computes the correlation coefficients
corr_mat = data.corr()

#to plot the correlation matrix
fig = px.imshow(corr_mat, color_continuous_scale='RdBu_r', title='Correlation Matrix - Nutrients')
fig.show()

In our dataset:

  • Threonine & Leucine (corr coeff = 0.99669) and Valine & Leucine (corr coeff = 0.99657) have the highest correlation coefficient values, indicating a strong positive relationship.
  • Vitamin D2 (ergaocalciferol) & Manganese have the weakest relationship (corr coeff = 0.00000134), showing almost no correlation.

Saving Correlation Coefficients

We now create csv files to store variables showing positive and negative correlations, in decreasing order of their correlation strength.

In [12]:
#Calculate the Correlation Coefficients

corr_pairs = corr_mat.unstack()

#sort the values using Quick Sort
sorted_pairs = corr_pairs.sort_values(kind="quicksort")

A. Positive Correlations

Pairs of Nutrients that have a positive correlation will be saved in _pospairs.csv.
There are 10366 pairs of nutrients with positive correlation coefficients.

In [13]:
#find pairs with a positive correlation coefficients 
pos_pairs = sorted_pairs[sorted_pairs>0]

#store the calculated values in a csv file 'pos_pairs.csv'
csvFile = open('pos_pairs.csv', 'w', newline='')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['Nutrient1','Nutrient2','Value'])

# we use -2 to prevent storing the same value twice 
for i in range(len(pos_pairs.index)-1,-1,-2):
    csvWriter.writerow([pos_pairs.index[i][0],pos_pairs.index[i][1],pos_pairs[i]])

data = pd.read_csv('pos_pairs.csv') 
print(len(data))
10342

B. Negative Correlations

Pairs of Nutrients that have a negative correlation will be saved in _negpairs.csv.
There are 667 pairs of nutrients with positive correlation coefficients.

In [14]:
#find pairs with a negative correlation coefficients 
neg_pairs = sorted_pairs[sorted_pairs<0]

#store the calculated values in a csv file 'neg_pairs.csv'
csvFile = open('neg_pairs.csv', 'w', newline='')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['Nutrient1','Nutrient2','Value'])

for i in range(0, len(neg_pairs.index),2):
    csvWriter.writerow([neg_pairs.index[i][0],neg_pairs.index[i][1],neg_pairs[i]])

data = pd.read_csv('neg_pairs.csv') 
print(len(data))
667

There are more positively correlated pairs than negatively correlated pairs.

Weighted Networks


Ingredient - Ingredient Graph

We will now create a weighted graph between ingredients, where the weights represent the number of shared recipes. To do so, I have created a 2-dimensional dictionary.

In [15]:
#read the dataset
data = pd.read_csv("../RecipeDB/Cosy_Drive/Recipe_correct_ndb.csv")

data = pd.DataFrame(data)
data_dict = data.to_dict()

ingr_arr = []
reci_arr = []
for i in range(len(data)):
    ing = data_dict['ingredient'][i]
    rec = data_dict['recipe_no'][i]
    ingr_arr.append(ing)
    reci_arr.append(rec)

reci_arr = list(set(reci_arr))
ingr_arr = list(set(ingr_arr))

#2d-dictionary to create the graph
dict_ingrs = {}
temp_ing = []
for i in range(len(ingr_arr)):
    if(ingr_arr[i] not in temp_ing):
        dict_ingrs[ingr_arr[i]] = i
        temp_ing.append(ingr_arr[i])

f = data

#m stores the max of reci_arr and ingr_arr to determine the length of the dictionary
m = max(len(reci_arr), len(ingr_arr))

source_dict = dict()
for ind in f.index:
    source = dict_ingrs[f['ingredient'][ind]]
    recipe_no = f["recipe_no"][ind]
    
    #if two ingredients have a common recipe
    if source in source_dict:
        if recipe_no not in source_dict[source]:
            source_dict[source][recipe_no] = ""
    else:
        source_dict[source] = dict()
        source_dict[source][recipe_no] = ""

weights = []
for i in range(m+1):
    temp = [0]*(m+1)
    weights.append(temp)

for ingredient in source_dict:
    for source2 in source_dict:
        count = 0
        if ingredient!=source2:
            for recipe_no in source_dict[ingredient]:
                if recipe_no in source_dict[source2]:
                    count += 1
        weights[ingredient][source2] = count

Saving in a csv file

The file 'ingredient_weights.csv' contains the weighted graph.

In [16]:
#to save the graph in a csv file 
csv_file = open("ingredient_weights.csv",'w',newline='')
csv_writer = csv.writer(csv_file)

for i in range(len(reci_arr)):
    for j in range(len(reci_arr)):
        if(weights[i][j]>0 and i!=j):
            if(ingr_arr[i]!='nan' and ingr_arr[j]!='nan'):
                csv_writer.writerow([ingr_arr[i],ingr_arr[j],weights[i][j]])

VoteRank Algorithm


VoteRank Algorithm is used to find the most influential nodes in a network, using metrics such as location in the network, degree and edge weights.

Top 5 influential/key ingredients in our network:

  1. garlic
  2. water
  3. salt
  4. onion
  5. vegetable oil


Reference: Zhang, J.X., Chen, D.B., Dong, Q. and Zhao, Z.D., 2016. Identifying a set of influential spreaders in complex networks. Scientific reports, 6, p.27823.

In [17]:
#read the list
G = nx.read_edgelist('../RecipeDB/backbone_extraction/ingredient_weights.csv', delimiter=',' ,encoding='latin1',create_using=nx.Graph(),nodetype=str,data=(('weight',int),))

csv_file = open('voteRank_ingredient.csv','w',newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Nodes'])

#use the nx library voterank
voteRankList = nx.algorithms.centrality.voterank(G)

for i in voteRankList:
    csv_writer.writerow([i])

print(voteRankList[:5])
['garlic', 'water', 'salt', 'onion', 'vegetable oil']

The file 'voteRank_ingredient.csv' stores the list of most influential ingredients based on shared recipes.

Backbone Extraction


To extract the backbone structure of an undirected weighted network, Disparity Filter - a network reduction algorithm is used.

Disparity filter identifies the links that should be preserved in the network.
The Null hypothesis used is: normalized weights that correspond to the connections of a certain node of degree k are produced by a random assignment from a uniform distribution.
By imposing a significance level $\alpha$, the links that carry weights that can be considered not compatible with a random distribution can be filtered out with a certain statistical significance.

The statistically relevant edges will be those whose weights satisfy the relation:
$\alpha_{ij} = 1 - (k - 1) \int_{0}^{p_{ij}}(1 - x)^{k-2}dx < \alpha$
where k: number of connections of the node to which the link under consideration is attached.
$p_{ij}$: normalised weight.
$\alpha_{ij}$: probability that the node's normalised weight $p_{ij}$ is compatible with the null hypothesis.

The multiscale backbone is then obtained by preserving all the links that satisfy the above criterion for at least one of the two nodes at the ends of the link while discounting the rest.


Reference: M. A. Serrano et al. (2009) Extracting the Multiscale backbone of complex weighted networks. PNAS, 106:16, pp. 6483-6488.
The following code has been taken from: aekpalakorn/python-backbone-network/blob/master/backbone.py

In [18]:
'''
This module implements the disparity filter to compute a significance score of edge weights in networks
'''

import networkx as nx
import numpy as np
from scipy import integrate
import matplotlib.pyplot as plt


def disparity_filter(G, weight='weight'):
    ''' Compute significance scores (alpha) for weighted edges in G as defined in Serrano et al. 2009
        Args
            G: Weighted NetworkX graph
        Returns
            Weighted graph with a significance score (alpha) assigned to each edge
        References
            M. A. Serrano et al. (2009) Extracting the Multiscale backbone of complex weighted networks. PNAS, 106:16, pp. 6483-6488.
    '''
    
    if nx.is_directed(G): #directed case    
        N = nx.DiGraph()
        for u in G:
            
            k_out = G.out_degree(u)
            k_in = G.in_degree(u)
            
            if k_out > 1:
                sum_w_out = sum(np.absolute(G[u][v][weight]) for v in G.successors(u))
                for v in G.successors(u):
                    w = G[u][v][weight]
                    p_ij_out = float(np.absolute(w))/sum_w_out
                    alpha_ij_out = 1 - (k_out-1) * integrate.quad(lambda x: (1-x)**(k_out-2), 0, p_ij_out)[0]
                    N.add_edge(u, v, weight = w, alpha_out=float('%.4f' % alpha_ij_out))
                    
            elif k_out == 1 and G.in_degree(G.successors(u)[0]) == 1:
                #we need to keep the connection as it is the only way to maintain the connectivity of the network
                v = G.successors(u)[0]
                w = G[u][v][weight]
                N.add_edge(u, v, weight = w, alpha_out=0., alpha_in=0.)
                #there is no need to do the same for the k_in, since the link is built already from the tail
            
            if k_in > 1:
                sum_w_in = sum(np.absolute(G[v][u][weight]) for v in G.predecessors(u))
                for v in G.predecessors(u):
                    w = G[v][u][weight]
                    p_ij_in = float(np.absolute(w))/sum_w_in
                    alpha_ij_in = 1 - (k_in-1) * integrate.quad(lambda x: (1-x)**(k_in-2), 0, p_ij_in)[0]
                    N.add_edge(v, u, weight = w, alpha_in=float('%.4f' % alpha_ij_in))
        return N
    
    else: #undirected case
        B = nx.Graph()
        for u in G:
            k = len(G[u])
            if k > 1:
                sum_w = sum(np.absolute(G[u][v][weight]) for v in G[u])
                for v in G[u]:
                    w = G[u][v][weight]
                    p_ij = float(np.absolute(w))/sum_w
                    alpha_ij = 1 - (k-1) * integrate.quad(lambda x: (1-x)**(k-2), 0, p_ij)[0]
                    B.add_edge(u, v, weight = w, alpha=float('%.4f' % alpha_ij))
        return B

def disparity_filter_alpha_cut(G,weight='weight',alpha_t=0.4, cut_mode='or'):
    ''' Performs a cut of the graph previously filtered through the disparity_filter function.
        
        Args
        ----
        G: Weighted NetworkX graph
        
        weight: string (default='weight')
            Key for edge data used as the edge weight w_ij.
            
        alpha_t: double (default='0.4')
            The threshold for the alpha parameter that is used to select the surviving edges.
            It has to be a number between 0 and 1.
            
        cut_mode: string (default='or')
            Possible strings: 'or', 'and'.
            It works only for directed graphs. It represents the logic operation to filter out edges
            that do not pass the threshold value, combining the alpha_in and alpha_out attributes
            resulting from the disparity_filter function.
            
            
        Returns
        -------
        B: Weighted NetworkX graph
            The resulting graph contains only edges that survived from the filtering with the alpha_t threshold
    
        References
        ---------
        .. M. A. Serrano et al. (2009) Extracting the Multiscale backbone of complex weighted networks. PNAS, 106:16, pp. 6483-6488.
    '''    
    
    if nx.is_directed(G):#Directed case:   
        B = nx.DiGraph()
        for u, v, w in G.edges(data=True):
            try:
                alpha_in =  w['alpha_in']
            except KeyError: #there is no alpha_in, so we assign 1. It will never pass the cut
                alpha_in = 1
            try:
                alpha_out =  w['alpha_out']
            except KeyError: #there is no alpha_out, so we assign 1. It will never pass the cut
                alpha_out = 1  
            
            if cut_mode == 'or':
                if alpha_in<alpha_t or alpha_out<alpha_t:
                    B.add_edge(u,v, weight=w[weight])
            elif cut_mode == 'and':
                if alpha_in<alpha_t and alpha_out<alpha_t:
                    B.add_edge(u,v, weight=w[weight])
        return B

    else:
        B = nx.Graph()#Undirected case:   
        for u, v, w in G.edges(data=True):
            
            try:
                alpha = w['alpha']
            except KeyError: #there is no alpha, so we assign 1. It will never pass the cut
                alpha = 1
                
            if alpha<alpha_t:
                B.add_edge(u,v, weight=w[weight])
        return B          

We will now read the file _ingredientweights.csv that stores the weighted graph of ingredients having common recipes.

In [19]:
if __name__ == '__main__':

    #read the file
    G = nx.read_edgelist('ingredient_weights.csv', delimiter=',' , encoding='latin1',create_using=nx.Graph(),nodetype=str,data=(('weight',int),))
    alpha = 0.05
    
    #use the disparity filter algorithm
    G = disparity_filter(G)
    G2 = nx.Graph([(u, v, d) for u, v, d in G.edges(data=True) if d['alpha'] < alpha])
    print('alpha =', alpha)
    print('The extracted backbone structure contains: \n{0:.2f}% of the original nodes \n{1:.2f}% of the original edges'.format((G2.number_of_nodes()/G.number_of_nodes())*100, (G2.number_of_edges()/G.number_of_edges())*100))
alpha = 0.05
The extracted backbone structure contains: 
21.13% of the original nodes 
5.85% of the original edges

If we take $\alpha = 0.05$, the extracted backbone structure contains:

  • 21.52% of the original nodes
  • 5.74% of the original edges

Thus, we can significantly reduce the components in our network while simultaneously preserving the backbone structure.

Visualisation of the Backbone Structure

We use the pyvis library to create and save the interactive backbone graph as _backboneingredient.html.

In [30]:
    #use the pyvis library to visualise our graph
    nt = Network(width='100%',height='85%', font_color="white", heading='Backbone Structure for Ingredients')
    nt.from_nx(G2)
    nt.barnes_hut()
    
    #save the graph as the file 'backbone_ingredient.html'
    nt.show('backbone_ingredient.html')

Load it using the python library IPython.

Note: Zoom into the plot to see the label of each node.

In [21]:
#load the saved graph
nt.prep_notebook()
from IPython.display import IFrame
IFrame('backbone_ingredient.html', width=800, height=600)
Out[21]:

Community Detection


Community detection is a method used to find groups or clusters within network through graphs.

We have used the Louvain algorithm for community detection in our ingredients network. This algorithm uses a parameter called modularity to extract communities or groups. Modularity measures the strength of division of a network into modules (groups). High modularity in networks represents dense connections between the nodes within modules, and sparse connections in different modules. The Louvain Algorithm is a greedy optimisation method that tries to optimise the modulariry of a community in the network.
Thus, the Louvain algorithm evaluates how densely connected the nodes in a partition are and recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.

For this we use the python library _communitylouvain.

In [22]:
import community as community_louvain
import matplotlib.cm as cm
import networkx as nx

#read the graph
G = nx.read_edgelist('ingredient_weights.csv', delimiter=',' ,encoding='latin1', create_using=nx.Graph(), nodetype=str, data=(('weight',int),))

# compute the best partition
partition = community_louvain.best_partition(G)

Visualising the Community Structure

In [23]:
# draw the graph
pos = nx.spring_layout(G, k=0.25)

plt.figure(figsize=(10, 10))
plt.axis('off')
plt.title('Community Detection - Louvain Algorithm')

# color the nodes according to their partition
cmap = cm.get_cmap('Set1', max(partition.values()) + 1)
nx.draw_networkx_nodes(G, pos, partition.keys(), node_size=40, cmap=cmap, node_color=list(partition.values()))
nx.draw_networkx_edges(G, pos, alpha=0.5)

plt.show()

Saving the Communities

We detect 10 communities in our network. These ingredients along with their community number are stored in the file _communitylists.csv.

In [24]:
#create the csv file
csv_file = open('community_lists.csv','w',newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name','Category'])

arr_cats = []
for x,y in partition.items():
  arr_cats.append(y)
  csv_writer.writerow([x,y])

csv_file.close()

#save the ingredients with their community number
data = pd.read_csv('community_lists.csv')
data = data.sort_values(["Category"])
data.to_csv('community_lists.csv', index=False)

print('Number of communities detected:', len(set(arr_cats)))
Number of communities detected: 10